-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using version uuid
in datachain pull
#621
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #621 +/- ##
==========================================
- Coverage 87.73% 87.71% -0.03%
==========================================
Files 112 112
Lines 10675 10694 +19
Branches 1437 1439 +2
==========================================
+ Hits 9366 9380 +14
- Misses 950 954 +4
- Partials 359 360 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -170,7 +174,7 @@ def check_for_status(self) -> None: | |||
Checks are done every PULL_DATASET_CHECK_STATUS_INTERVAL seconds | |||
""" | |||
export_status_response = self.studio_client.dataset_export_status( | |||
self.dataset_name, self.dataset_version | |||
self.remote_ds_name, self.remote_ds_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] Why not use uuid
instead of remote_ds_name
+ remote_ds_version
& local_ds_name
+ local_ds_version
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, there are other APIs that use name
+ version
as well, but I would leave that to followup issue not to bloat this one too much.
In general, user does use explicit name and version when initiating pull, e.g datachain pull ds://dogs@v3
but the use of uuid
in the process after initial get dataset info would fix potential issue of someone renaming remote dataset right in the middle of pull process (not sue how likely is this to happen but it's still the potential risk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this needs to go into this PR but I wanted to mention it and this seems like an ok place.
I would advocate for hiding local/remote name/version from users as much as possible. (From experience) I think the concept will only confuse users. Only exposing it when we absolutely need to makes sense to me
For example:
If the user tries to pull cat@v1
under the following circumstances we should throw an error stating there is a dataset with that name already and the uuid doesn't match. Ask them to rename the local dataset (as SaaS is the source of truth):
name | version | uuid | local | remote |
---|---|---|---|---|
cat | 1 | cf308a6b-0e66-477f-ba76-9c3f8ffd27fe | ✓ | X |
cat | 1 | d1c1ac59-5763-4069-a328-8292ec6a6205 | X | ✓ |
I think we should update the output of ls datasets
to have the same format as the above or at least give it the ability to show these discrepancies somehow.
I am now second-guessing myself and wondering whether the local/remote name and version are that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it works in this PR:
- If local dataset with same
uuid
exists -> skip pulling and printing "Local copy of dataset cats@v1 already present". If it has different name locally it says "Local copy of dataset cats@v1 already present as dataset other_cats@v5" - If we don't find local dataset with same
uuid
and there is already local dataset name + version present that holds completely different data we throw error with message "Local dataset cats@v1 already exists, please choose different local dataset name or version"
I didn't want to mention uuid
anywhere as I'm not sure yet if that's something we should bother user with, but maybe I could enhance message from 2) by mentioning it ... "
EDIT: Changed it to "Local dataset cats@v1 already exists with different uuid, please choose different local dataset name or version"
src/datachain/catalog/catalog.py
Outdated
remote_ds_version = remote_ds.get_version(version) | ||
except (DatasetVersionNotFoundError, StopIteration) as exc: | ||
raise DataChainError( | ||
f"Dataset {remote_ds_name} doesn't have version {version}" " on server" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[C] Should be a single string
@@ -400,6 +400,18 @@ def get_parser() -> ArgumentParser: # noqa: PLR0915 | |||
"--edatachain-file", | |||
help="Use a different filename for the resulting .edatachain file", | |||
) | |||
parse_pull.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[C] I still don't know if it is better to expose these or uuid to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should let client work with name / version. uuid
is not really practical, and also user cannot really set it.
In this PR we are changing the way to find if the same local dataset already exists when pulling remote dataset from Studio or not. Before we were using
name
+version
combination which is not stable (as commented by Matt here) as someone could have local dataset with the same name and version as remote one, but with totally different data which would lead to not pulling the remote one. Also, someone could pull remote dataset and then rename it which would lead to pulling remote dataset again on another pull.Now, instead of
name
+version
combination we useuuid
which is uniquely identifying specific dataset (version).If local dataset version with same
uuid
as remote one already exists, regardless of thename
orversion
mismatch, it is not pulled again.Also, optional parameters
local-name
andlocal-version
are added to CLI of dataset pull to specify the name / version of local dataset to which we want to pull remote one. This is needed in the case if local dataset with same name + version already exists with different data, but we still want to pull remote one - in that case use can enter alternative name or version or both.